After tidying our data set, an exploratory analysis is conducted to
look for possible predictors for the Attendance
outcome.
A brief summary of attendance based on the Type variable
is provided below:
theme_park |>
group_by(Year, Type) |>
mutate(
Attendance = Attendance / 100000
) |>
summarise(sum = sum(Attendance)) |>
arrange(Type) |>
pivot_wider(
names_from = Type,
values_from = sum
) |>
knitr::kable(digits = 3, caption = c("Summary of Attendance for Three Types of Facilities From 2019 to 2022"))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
| Year | Amusement/Theme Park | Museum | Water Park |
|---|---|---|---|
| 2019 | 37996.4 | 20100.8 | 5898.9 |
| 2020 | 13031.1 | 4664.5 | 2313.5 |
| 2021 | 22463.7 | 6459.0 | 3473.5 |
| 2022 | 21280.8 | 11603.3 | 4678.3 |
From this table, some observed patterns are:
The distribution of data by year is further visualized into the box plots below:
theme_park |>
group_by(Year) |>
plot_ly(y = ~Attendance, color = ~Year, type = "box", colors = "viridis") |>
layout(annotations =
list(x = 1, y =1, text = "Plot 1: Distribution of Attendance by Year",
showarrow = F, xref='paper', yref='paper',
xanchor='right', yanchor='auto', xshift=0, yshift=0,
font=list(size=15))
)
Next, we specifically look at the trend of Attendance
from 2019 to 2022 based on the Region variable.
theme_park|>
group_by(Region, Year) |>
summarize(attend_sum = mean(Attendance)) |>
plot_ly(x = ~Year, y = ~attend_sum, color = ~Region,
type = "scatter", mode = 'point', colors = "viridis") |>
layout(annotations =
list(x = 1, y = 1, text = "Plot 2: Change in Attendance for Each Region",
showarrow = F, xref='paper', yref='paper',
xanchor='right', yanchor='auto', xshift=0, yshift=0,
font=list(size=15))
)
## `summarise()` has grouped output by 'Region'. You can override using the
## `.groups` argument.
theme_full =
read_csv("ultimate data.csv")
## Rows: 920 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Park_Name, City, Country, Type, Region
## dbl (2): Year, Attendance
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Looking at the plot 1, we found that there was too much data for outliers, so we did another data analysis based on Top 25 theme park, which is labled as Worldwide on the region variable.
We use “Worldwide” to rank the top 25 theme parks by total number of visitors in different countries. According to the icon, we can get:
The total number of tourists from U.S. is the largest from 2019 to 2022, followed by China and Japan in third place.
Russina, Germany and Spain have the lowest total tourist arrivals in 2019-2022.
The first ANOVA test focuses on the Type variable in our
data set. The null hypothesis and alternative hypothesis are listed as
follow:
\[H_0: \mu_{\text{Amusement/Theme Park}} = \mu_{\text{Water Park}} = \mu_{\text{Museum}} ~~ \text{vs} ~~ H_1: \text{at least two means are not equal}\]
anova_1 = aov(Attendance ~ Type, data = theme_park)
summary(anova_1)
## Df Sum Sq Mean Sq F value Pr(>F)
## Type 2 1.127e+17 5.635e+16 105.3 <2e-16 ***
## Residuals 737 3.944e+17 5.351e+14
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With a p-value of less than 2e-16, we would reject the null
hypothesis. We have evidence that at least two of the means are not
equal. Meaning the mean attendance among type groups is different for at
least two groups in the Type variable.
The Second ANOVA test focuses on the Year variable in
our data set. The null hypothesis and alternative hypothesis are listed
as follow:
\[H_0: \mu_{\text{2019}} = \mu_{\text{2020}} = \mu_{\text{2021}}= \mu_{\text{2022}} ~~ \text{vs} ~~ H_1: \text{at least two means are not equal}\]
dat =
theme_full |>
filter(
Region != c("Worldwide")
) |>
mutate(
Year = as.factor(Year)
)
anova2 = aov(Attendance ~ Year , data = dat) |>
summary()
With a p-value of less than 2e-16, we would reject the null hypothesis. We have evidence that at least two of the means are not equal. Meaning the mean attendance among year groups is different for at least two groups.